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Abstract 

A tree-ensemble method, referred to as time series forest (TSF), is proposed 
for time series classification. TSF employs a combination of entropy gain 
and a distance measure, referred to as the Entrance (entropy and distance) 
gain, for evaluating the splits. Experimental studies show that the Entrance 
gain improves the accuracy of TSF. TSF randomly samples features at each 
tree node and has computational complexity linear in the length of time 
series, and can be built using parallel computing techniques. The temporal 
importance curve is proposed to capture the temporal characteristics useful 
for classification. Experimental studies show that TSF using simple features 
such as mean, standard deviation and slope is computationally efficient and 
outperforms strong competitors such as one-nearest-neighbor classifiers with 
dynamic time warping. 
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1. Introduction 



Time series classification has been playing an important role in many 
disciplines such as finance 25| and medicine |2| . Although one can treat the 
value of each time point as a feature and use a regular classifier such as one- 
nearest-neighbor (NN) with Euclidean distance for time series classification, 
the classifier may be sensitive to the distortion of the time axis and can lead 
to unsatisfactory accuracy performance. One-nearest-neighbor with dynamic 
time warping (NNDTW) is robust to the distortion of the time axis and 
has proven exceptionally difficult to beat [20]. However, NNDTW provides 
limited insights into the temporal characteristics useful for distinguishing 
time series from different classes. 



The temporal features calculated over time series intervals [15|, referred 
to as interval features, can capture the temporal characteristics, and can also 
handle the distortion in the time axis. For example, in the two-class time 
series shown in Figure [H the time series from one of the classes have sudden 
changes between time 201 and time 400 but not in the same time points. An 
interval feature such as the standard deviation between time 201 and time 
400 is able to distinguish the two-class time series. 



Previous work [15[ has built decision trees on interval features. However, 
a large number of interval features can be extracted from time series, and 
there can be a large number of candidate splits to evaluate at each tree 
node. Class-based measures (e.g., entropy gain), which evaluate the ability of 
separating the classes, are commonly used to select the best split in a node. 
However, there can be many splits having the same ability of separating 
the classes. Therefore, measures able to further distinguish these splits are 
desirable. Also, given a large number of features/splits, an efficient and 
accurate classifier that can provide insights into the temporal characteristics 
is valuable. 

To this end, we propose a novel tree-ensemble classifier: time series forest 
(TSF). TSF employs a new measure called the Entrance (entropy and dis- 
tance) gain to identify high-quality splits. We show that TSF using Entrance 
gain outperforms TSF using entropy gain and also two NNDTW algorithms. 
By using a random feature sampling strategy, TSF has computational com- 
plexity linear in the time series length. Furthermore, we propose the tempo- 
ral importance curve to capture the temporal characteristics informative for 
time series classification. 

The remainder of this paper is organized as follows. Section [2]presents the 



2 



Class 1 

Class 2 

Jl 



200 400 600 800 1000 

Figure 1: The time series from class 2 have sudden changes between time 201 and time 
400. An interval feature such as the standard deviation between time 201 and time 400 
can distinguish the time series from the two classes. 

definition of tlie problem and related work. Section [3] introduces the interval 
features. Section H] describes the TSF method. Section \5\ demonstrates the 
effectiveness and efficiency of TSF by experiments. Conclusions are drawn 
in Section [61 

2. Definition and Related Work 

Given training time series instances (examples): {ei, e^, eAr} and 
the corresponding class labels {yi, ...,yi, yN}, where yi G {1,2,..., C}, the 
objective of time series classification is to predict the class labels for test- 
ing instances. Here we assume the values of time series are measured at 
equally-spaced intervals, and also assume the training and testing time series 
instances are of the same length M. 

Time series classification methods can be divided into instance-based and 
feature-based methods. Instance-based classifiers predict a testing instance 
based on its similarity to the training instances. Among instance-based clas- 
sifiers, nearest-neighbor classifiers with Euclidean distance (NNEuclidean) 
or dynamic time warping (NNDTW) have been widely and successfully used 
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Usually NNDTW performs better than NNEuclidean (dy- 



namic time warping 
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^ ^ is robust to the distortion in the time axis), and 

is considered as a strong solution for time series problems [l3|. Instance- 
based classifiers can be accurate, but they provide limited insights into the 
temporal characteristics useful for classification. 

Feature-based classifiers build models on temporal features, and poten- 
tially can be more interpretable than instance-based classifiers. Feature- 
based classifiers commonly consist of two steps: defining the temporal fea- 
tures and training a classifier based on the temporal features defined. Nanopou- 
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extracted statistical features such as the mean and deviation of 
an entire time series, and then used a multi-layer perceptron neural network 
for classification. This method only captured the global properties of time se- 
ries. Local properties, potentially informative for classification, were ignored. 
Geurts extracted local temporal properties after discretizing the time se- 



ries. Rodriguez et al. [15| boosted binary stumps on temporal features from 



intervals of the time series and Rodriguez and Alonso [IJ], Rodriguez et al 



16[ applied classifiers such as a decision tree and a SVM on the temporal 
features extracted from the boosted binary stumps. However, only binary 
stumps were boosted, and the effect of using more complex base learners, 
such as decision trees, should be studied [l5[ (but larger tree models impact 
the computational complexity). Furthermore, in decision trees 15|, [ij, Il6 
class-based measures are often used to evaluate the candidate splits in a 
node. However, the number of candidate splits is generally large, and, thus, 
there can be multiple splits having the same ability of separating the classes. 
Consequently, additional measures able to further distinguish these features 
are desirable. Ye and Keogh j23| briefly discussed strategies of introducing 
additional measures to break ties, but it was in a different context. 

Recently, Ye and Keogh 23| proposed time series shapelets to perform 
interpretable time series classification. Shapelets are time series subsequences 



which are in some sense maximally representative of a class j23|. Ye and 



Keogh 23 1, Xing et al. 22 1, Lines et al. |l3] have successfully shown that 
time series shapelets can produce highly interpretable results. In term of 
accuracy. Lines et al. 10|] showed that the shapelet approach is comparable 
to NNDTW for nine data sets investigated. 

Eruhimov et al. Q considered a massive number of features. The feature 
sets were derived from statistical moments, wavelets, Chebyshev coefficients, 
PCA coefficients, and the original values of time series. The method can 
be accurate, but is hard to interpret and computationally expensive. The 
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objective of our work is to produce an effective and efficient classifier that 
uses/yields a set of simple features that can contribute to the domain knowl- 
edge. For example, in manufacturing applications, specific properties of the 
time series signals that discriminate conforming from un-conforming prod- 
ucts are invaluable to diagnose, correct, and improve processes. 



3. Interval Features 

Interval features are calculated from a time series interval, e.g., "the in- 
terval between time 10 and time 30". Many types of features over a time 
interval can be considered, but one may prefer simple and interpretable fea- 
tures such as the mean and standard deviation, e.g., "the average of the time 
series segment between time 10 and time 30" . 

Let K be the number of feature types and /fc(-) (A; = 1, 2, be the k^'^ 
type. Here we consider three types: /i = mean, f2 = standard deviation, 
/s = slope. Let fk{ti,t2) for 1 < ti < t2 < M denote the A;*^ interval feature 
calculated over the interval between ti and t2- Let Vi be the value at time i 
for a time series example. Then the three interval features for the example 
are calculated as follows: 



Eh 
i=ti 




t2-ti + l 



(1) 



T^i *2 > ti 

t2 = ti 



Mh,t2)=\: (3) 




where j3 is the slope of the least squares regression line of the training set 
{(ti, VtJ, (ti + 1, Vt^+i). . . . , (ta, Vt^)}. 

Interval features have been shown to be effective for time series classi- 
fication [lil, 14, IgI . However, the interval feature space is large ((^(M^)). 



Rodriguez et al. [15| considered using only intervals of lengths equal to pow- 
ers of two, and, therefore, reduced the feature space to 0(M log M). Here 
we consider the random sampling strategy used in a random forest [l[ that 
reduces the feature space to 0{M) at each tree node. 
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4. Time Series Forest Classifier 



4-1- Splitting criterion 

A time series tree is tlie base component of a time series forest, and the 
splitting criterion is used to determine the best way to split a node in a tree. 
A candidate split in a time series tree node tests the following condition 
(for simplicity and without loss of generality, we assume the root node here): 

fk{tl,t2)<T (4) 

for a threshold r. The instances satisfying the condition are sent to the left 
child node. Otherwise, the instances are sent to the right child node. 

Let {fj!{ti,t2),n G 1,2, A^} denote the set of values of fk(ti,t2) for all 
training instances at the node. To obtain a good threshold r in equation HI 
one can sort the feature values of all the training instances and then select the 
best threshold from the midpoints between pairs of consecutive values, but 



this can be too costly [IJ]. We consider the strategy employed in Rodriguez 
and Alonso [14|. The candidate thresholds for a particular type feature fk 
are formed such that the range of [min^^;^(/^(ti, ^2)), Hicix^^]^(/^(ti, ^2)] is 
divided into equal-width intervals. The number of candidate thresholds is 
denoted as k and is fixed, e.g., 20. The best threshold is then selected from 
the candidate thresholds. In this manner, sorting is avoided, and only k tests 
are needed. 

Furthermore, a splitting criterion is needed to define the best split S*: 
/*(^i)^2) — '^*- employ a combination of entropy gain and a distance 
measure as the splitting criterion. Entropy gain are commonly used as the 
splitting criterion in tree models. Denote the proportions of instances corre- 
sponding to classes {1, 2, C} at a tree node as {71, 72, 7c}) respectively. 
The entropy at the node is defined as 

Entropy = -T,^^^jJog-fc (5) 

The entropy gain AEntropy for a split is then the difference between the 
weighted sum of entropy at the child nodes and the entropy at the parent 
node, where the weight at a child node is the proportion of instances assigned 
to that child node. 

AEntropy evaluates the usefulness of separating the classes. However, 
in time series classification, the number of candidate splits can be large, 
and there are often cases where multiple candidate splits have the same 
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AEntropy. Therefore we consider an additional measure called Margin, 
which calculates the distance between a candidate threshold and its nearest 
feature value. The Margin of split fki^i.ti) < t is calculated as 

Margin = _min ^ 1/^(^1,^2) - r\ (6) 

where /^(ti,t2) is the value of fk(ti,t2) for the n^^ instance at the node. A 
new splitting criterion E, referred to as the Entrance (entropy and distance) 
gain, is defined as a combination of AEntropy and Margin. 

E = AEntropy + a ■ Margin (7) 

where a is small enough so that the only role for a in the model is to break 
ties that can occur from the entropy gain alone. Alternatively, one can store 
the values of AEntropy and Margin for a split, and use Margin to break 
ties when another split has the same AEntropy. 

Clearly, the split with the maximum E should be selected to split the 
node. Furthermore, Margin and E are sensitive to the scale of the features, 
and we employ the following strategy if different types of features have dif- 
ferent scales. For each feature type fk, select the split with the maximum 
Entrance gain. To compare the best splits from different feature types, the 
split with the maximum AEntropy is selected. If the best splits from differ- 
ent feature types have the same maximum AEntropy, one of the best splits 
is randomly selected. 




S1 S2 S3 



Figure 2: Here the a;- axis represents the value of an interval feature. The figure shows 
six instances associated with three classes (blue, red, and green), and three splits {Si, S2, 
and S3) producing the same entropy gain. The Entrance gain E is able to select ^3 as the 
best split. 

Figure [2] illustrates the intuition behind the criterion E. The figure shows, 
in one dimension, six instances from three classes in different symbols/colors. 
Three candidate splits Si, S2 and are also shown in the figure. Clearly, 
all splits have the same AEntropy, but one may prefer ^3 because 5*3 has a 
larger margin than Si and 5*2. The Entrance gain is able to choose 5*3 as the 
best split. 
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Algorithm 1 sampleQ function: randomly samples a set of intervals < 
Ti,T2 >, where Ti is the set of starting time points of intervals, and T2 is 
the set of ending points. The function RandSampNoRep{set, samplesize) 
randomly selects samplesize elements from set without replacement. 
Ti = 0, T2 = 

W = RandSampNoRep{{l, M}, y/M) 
for w in set W do 

Ti = RandSampNoRep{{l, M -w + 1}, y/M - w + I) 

for ti in set Ti do 

T2 = T2\J ih + w-l) 

end for 
end for 

return <Ti,T2> 



Algorithm 2 tree{data): Time series tree. For simplicity of the algorithm, 
we assume different types of features are on the same scale so that E can be 
compared. 

< Ti,T2 >=sampleO 

calculate Threshold^, the set of candidate thresholds for each feature type k 
E* = 0, AEntropy* = 0, t*^ = 0, t2* = 0, r* = 0, = 
for < ti, t2 > in set < Ti,T2 > do 
for k in 1:K do 

for r in Thresholdk do 

calculate AEntropy and E for fk{t\,t2) < r 
ii E> E* then 

E* = E, AEntropy* = AEntropy, t^ = ti, t^ = t2, r* =t, f^: = fk 
end if 
end for 
end for 
end for 

if AEntropy* = then 

label this node as a leaf and return 
end if 

dataieft time series with f*{tl,t2) < t* 
datttright time series with f*{tl,t2) > r* 

tree{dataieft) 
tree{dataright) 
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4-2. Time Series Tree and Time Series Forest 

The construction of a time series tree follows a top-down, recursive strat- 
egy similar to standard decision tree algorithms, but uses the Entrance gain 
as the splitting criterion. Furthermore, the random sampling strategy em- 
ployed in random forest (RF) jl| is considered here. At each node, RF only 
tests ^Jp features randomly sampled from the complete feature set consisting 
of p features. In each time series tree node, we consider randomly sampling 
0{-\fM) interval sizes and 0{-\fM) starting positions. Therefore, the feature 
space is reduced to only 0{M). The sampling algorithm is illustrated in 
Algorithm [TJ 

The time series tree algorithm is shown in Algorithm [2l For simplicity, 
we assume different types of features are on the same scale so that E can 
be compared. If different types of features have different scales, the previous 
mentioned strategy can be used, that is, for each feature type /fc, select the 
split with the maximum Entrance gain. To compare the best splits from 
different feature types, the split with the maximum AEntropy is selected. 
Furthermore, a node is labeled as a leaf if there is no improvement on the 
entropy gain (e.g. all features have the same value or all instances belong to 
the same class). 

A time series forest (TSF) is a collection of time series trees. A TSF 
predicts a testing instance to be the majority class according to the votes 
from all time series trees. 

4-3. Computational Complexity 

Let n* denote the number of instances in the j^^ node at the i^^ depth in 
a time series tree. At each node, calculating the splitting criterion of a single 
interval feature has complexity where k is the number of candidate 

thresholds. As 0{M) interval features are randomly selected for evaluation, 
the complexity for evaluating the features at a node is 0(n*M/€). As k is 
considered as a constant, the complexity at a node is 0(n*M). 

The total number of instances at each depth is at most N (i.e., n* < 
A^). Therefore, at the i^^ depth in the tree, the complexity is 0{Y2j n)M ) < 
0{NM). Assuming the maximum depth of a tree model is 0(log A^) [l9|, the 
complexity of a time series tree becomes 0(MA^log A^). Therefore, the com- 
plexity of a TSF with nTree time series trees is at most 0{nTreeMN log N), 
linear in the length of time series. 
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4.4- Temporal Importance Curve 

TSF consists of multiple trees and is difficult to understand. Here we 
propose the temporal importance curve to provide insights into time series 
classification. At each node of TSF, the entropy gain can be calculated for 
the interval feature used for splitting. For a time index in the time series, one 
can add the entropy gain of all the splits associated with the time index for 
a particular type of feature. That is, for a feature type fk, the importance 
score for time index t can be calculated as 

Impk{t)= ^ AEntropy{fk{ti,t2),i') (8) 

ti<t<t2,u<^SN 

where SN is the set of split nodes in TSF, and AEntropy{fk{ti,t2), v) is the 
entropy gain for feature ^2) at node v. Note AEntropy{fk{ti, 12), z/) = 
if fk{ti,t2) is not used for splitting node u. Furthermore, one temporal 
importance curve is generated for each feature type. Consequently, for the 
mean, standard deviation and slope features, we calculate the mean, standard 
deviation, and slope temporal importance curves, respectively. 

To investigate the temporal importance curve, we simulated two data sets, 
each with 1000 time points and two classes. For the first data set the time 
series have the same distribution so that no feature is useful for separating 
the classes. The time series values from both classes are normally distributed 
with zero mean and unit variance. The time series and the importance curves 



from TSF using Entrance gain are shown in Figure 3(a) It can be seen that 
all curves have larger values in the middle. 

Note that the number of intervals that include time index t in a time 
series is 

Num{t) = t{M -t + 1) (9) 

Consequently, different time indices are associated with different numbers of 
intervals. The number of intervals for each time index for time series with 



1000 time points is plotted in Figure 3(b) The indices in the middle have 



more intervals than the indices on the edges of the time series. Because 
Impkit) is calculated by adding the entropy gain of all the splits associated 
with time index t for feature fk, it can be biased towards the time points 
having more interval features (particularly if no feature is important for clas- 
sification) . 

For the second data set the time series from the two classes have different 
means in interval [201, 250], and different standard deviations in interval 
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(a) The time series data and the importance (b) The number of intervals associated 
curves from TSF. with each time index. The time indices in 

the middle are contained in more intervals. 



Figure 3: When no feature is important for classification, the curves may be expected to 
have larger values for the middle indices as there are more intervals associated with the 
middle indices. 




200 400 600 800 1000 200 400 600 800 1000 



(a) The time series and the temporal im- (b) The time series and the temporal im- 
portance curves obtained from TSF using portance curves obtained from TSF using 
Entrance gain. entropy gain. 

Figure 4: The time series from the two classes differ in the mean in interval [201, 250], 
and differ in the standard deviation in interval [501, 550]. The importance curves from 
TSF using Entrance gain are able to capture the informative intervals well. The curves 
from TSF using entropy gain have peaks in interval [201, 250], but have long tails. 



11 



[501, 550]. The temporal importance curves from TSF using Entrance gain 



are shown in Figure 4(a) The curves for the mean and slope have peaks in 
interval [201, 250], and the curve for the standard deviation has a peak in 
interval [501, 550]. Therefore, these curves capture the important temporal 
characteristics. 

We also built TSF using entropy gain, and the corresponding temporal 



importance curves are shown in Figure 4(b) Although the curves also have 



peaks in interval [201, 250], the curves have long tails. Indeed, the entropy 
gain is not able to distinguish many interval features. For example, the mean 
feature for interval [201,250], and the mean feature for interval [201,400] have 
the same entropy gain as both can distinguish the two classes of time series. 
However, the mean feature for interval [201,250] has a larger E than the 
mean feature for interval [201,400]. Consequently, TSF using Entrance gain 
is able to capture the temporal characteristics more accurately. 



5. Experiments 

5.1. Experimental Setup 

The main functions of the TSF algorithm were implemented in Matlab, 
while computationally expensive subfunctions such as interval feature calcu- 
lations were written in C. The parameters were set as follows: the number 
of trees = 500, /(■) = {mean, standard deviation, slope}, and the number 
of candidate thresholds k = 20. TSF was applied to a set of time series 
benchmark data sets |9l] summarized in Table [H The training/ testing split 
setting is the same as in Keogh et al. j^. The experiments were run on a 
computer with four cores and the TSF algorithm was built in parallel. 

The purpose of the experiments is to answer the following questions: (1) 
Does the Entrance gain criterion improve the accuracy performance and how 
is the accuracy performance of TSF compared to other time series classi- 
fiers? (2) Is TSF computationally efficient? (3) Can the temporal impor- 
tance curves provide some insights about the temporal characteristics useful 
for classification? 

5.2. Results 

We investigated the performance of TSF using the Entrance gain criterion 
(denoted as TSF) and using the original entropy gain criterion (denoted 
as TSF-entropy), respectively. We also considered alternative classifiers for 
comparison: random forest [H applied to the interval features with sizes 



12 





Length 


Training 
instances 


Testing 
instances 


Classes 


SOwords 


270 


450 


455 


50 


Adiac 


176 


390 


391 


37 


Beef 


470 


30 


30 


5 


CBF 


128 


30 


900 


3 


ChlorineConcentration 


166 


467 


3840 


3 


CinC_ECG_torso 


1639 


40 


1380 


4 


CofFee 


286 


28 


28 


2 


Crickot_X 


300 


390 


390 


12 


(-rickot_\' 


300 


390 


390 


12 


Cricket-Z 


300 


390 


390 


12 


DiatomSizeReduction 


345 


16 


306 


4 


ECG200 


96 


100 


100 


2 


ECGFiveDays 


136 


23 


861 


2 


FaceAU 


131 


560 


1690 


14 


FaceFour 


350 


24 


88 


4 


FacesUCR 


131 


200 


2050 


14 


Fish 


463 


175 


175 


7 


GunPoint 


150 


50 


150 


2 


Haptics 


1092 


155 


308 


5 


InlineSkate 


1882 


100 


550 


7 


ItalyPowcrDcmand 


24 


67 


1029 


2 


Lighting2 


637 


60 


61 


2 


Lighting? 


319 


70 


73 


7 


MALLAT 


1024 


55 


2345 


8 


Medicallmagcs 


99 


381 


760 


10 


MoteStrain 


84 


20 


1252 


2 


NonlnvasiveFatalECG-Thoraxl 


750 


1800 


1965 


42 


NonInvasiveFatalECG_Thorax2 


750 


1800 


1965 


42 


OliveOil 


570 


30 


30 


4 


OSULeaf 


427 


200 


242 


6 


SonyAIBORobotSurface 


70 


20 


601 


2 


SonyAIBORobotSurfacell 


65 


27 


953 


2 


StarLightCurves 


1024 


1000 


8236 


3 


SwedishLeaf 


128 


500 


625 


15 


Symbols 


398 


25 


995 


6 


Syntheticcontrol 


60 


300 


300 


6 


Trace 


275 


100 


100 


4 


TwoLeadECG 


82 


23 


1139 


2 


TwoPattcrns 


128 


1000 


4000 


1 


u W avcCTOh t uicLi biai y_X 


315 


89(i 


3582 


8 


uWaveGestureLibrary.Y 


315 


896 


3582 


8 


uWaveGestureLibrary.Z 


315 


896 


3582 


8 


Wafer 


152 


1000 


6164 


2 


WordsSynonyms 


270 


267 


638 


25 


Yoga 


426 


300 


3000 


2 



Table 1: Summary of the time series data sets: the number of training and testing in- 
stances, the number of classes and the lengths of the time series. 

power of two (interRF), the 1-nearest-neighbor (NN) classifier with Euchdean 
distance (NNEuchdean), the 1-NN Best warping window DTW (DTWBest) 
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TSF 
Entrance 


TSF 
entropy 


interRF 


NN 
Euclidean 


DTW 

Best 


DTW 
NoWin 


50words 


0.2659 


0.2769 


0.2989 


0.3690 


0.2420 


0.3100 


Adiac 


0.2302 


0.2609 


0.2506 


0.3890 


0.3910 


0.3960 


Beef 


0.2333 


0.3000 


0.3000 


0.4670 


0.4670 


0.5000 


CBP 


0.0256 


0.0389 


0.0411 


0.1480 


0.0040 


0.0030 


ChlorincConccntration 


0.2537 


0.2596 


0.2273 


0.3500 


0.3500 


0.3520 


CinC-ECG.torso 


0.0391 


0.0688 


0.1065 


0.1030 


0.0700 


0.3490 


Coffoc 


0.0357 


0.0714 


0.0000 


0.2500 


0.1790 


0.1790 


Crickct.X 


0.2897 


0.2872 


0.3128 


0.4260 


0.2360 


0.2230 


Crickct.Y 


0.2000 


0.2000 


0.2436 


0.3560 


0.1970 


0.2080 


Crickct_Z 


0.2436 


0.2385 


0.2436 


0.3800 


0.1800 


0.2080 


DiatomSizcRcduction 


0.0490 


0.1013 


0.0980 


0.0650 


0.0650 


0.0330 


ECG200 


0.0800 


0.0700 


0.1700 


0.1200 


0.1200 


0.2300 


ECGFivcDays 


0.0557 


0.0697 


0.1231 


0.2030 


0.2030 


0.2320 


FaccAll 


0.2325 


0.2314 


0.2497 


0.2860 


0.1920 


0.1920 


FaccFour 


0.0227 


0.0341 


0.0568 


0.2160 


0.1140 


0.1700 


FaccsUCR 


0.1010 


0.1088 


0.1283 


0.2310 


0.0880 


0.0951 


Fish 


0.1543 


0.1543 


0.1486 


0.2170 


0.1600 


0.1670 


GunPoint 


0.0467 


0.0467 


0.0400 


0.0870 


0.0870 


0.0930 


Haptics 


0.5520 


0.5649 


0.5487 


0.6300 


0.5880 


0.6230 


InlincSkatc 


0.6818 


0.6746 


0.6873 


0.6580 


0.6130 


0.6160 


Italy Power Demand 


0.0301 


0.0330 


0.0321 


0.0450 


0.0450 


0.0500 


Lighting2 


0.1803 


0.1803 


0.2459 


0.2460 


0.1310 


0.1310 


Lighting? 


0.2603 


0.2603 


0.2740 


0.4250 


0.2880 


0.2740 


MALLAT 


0.0448 


0.0716 


0.0644 


0.0860 


0.0860 


0.0660 


Mcdicallmagcs 


0.2237 


0.2316 


0.2658 


0.3160 


0.2530 


0.2630 


MotcStrain 


0.1190 


0.1182 


0.0942 


0.1210 


0.1340 


0.1650 


NonlnvasivcFatalECG-Thoraxl 


0.0987 


0.1033 


0.1104 


0.1710 


0.1850 


0.2090 


NonInvasivcFatalECG_Thorax2 


0.0865 


0.0936 


0.0875 


0.1200 


0.1290 


0.1350 


OlivcOil 


0.0667 


0.1000 


0.1333 


0.1330 


0.1670 


0.1330 


OSULcaf 


0.4339 


0.4256 


0.4587 


0.4830 


0.3840 


0.4090 


SonyAIBORobot Surface 


0.2330 


0.2346 


0.2562 


0.1410 


0.1410 


0.1690 


SonyAIBORobotSurfacoII 


0.1868 


0.1773 


0.2067 


0.3050 


0.3050 


0.2750 


Star Light Curves 


0.0357 


0.0364 


0.0327 


0.1510 


0.0950 


0.0930 


SwcdishLeaf 


0.1056 


0.1088 


0.0768 


0.2130 


0.1570 


0.2100 


Symbols 


0.1116 


0.1206 


0.1216 


0.1000 


0.0620 


0.0500 


Syntheticcontrol 


0.0267 


0.0233 


0.0167 


0.1200 


0.0170 


0.0070 


Trace 


0.0200 


0.0000 


0.0400 


0.2400 


0.0100 


0.0000 


TwoLeadECG 


0.1177 


0.1115 


0.1773 


0.2530 


0.1320 


0.0960 


TwoPat terns 


0.0543 


0.0530 


0.0153 


0.0900 


0.0015 


0.0000 


uWaveGestureLibrary_X 


0.2102 


0.2127 


0.2094 


0.2610 


0.2270 


0.2730 


uWaveGestureLibrary_Y 


0.2876 


0.2881 


0.3023 


0.3380 


0.3010 


0.3660 


uWaveGestureLibrary_Z 


0.2624 


0.2669 


0.2764 


0.3500 


0.3220 


0.3420 


Wafer 


0.0054 


0.0047 


0.0071 


0.0050 


0.0050 


0.0200 


WordsSynonyms 


0.3793 


0.3809 


0.4138 


0.3820 


0.2520 


0.3510 


Yoga 


0.1513 


0.1567 


0.1380 


0.1700 


0.1550 


0.1640 


win/ lose /tie 




16/28/1 


13/32/0 


4/41/0 


17/28/0 


16/29/0 


Average rank 


2.48 


2.86 


3.43 


5.04 


3.31 


3.88 


Rank difference 




0.38 


0.96 


2.57 


0.83 


1.40 


Wilcoxon 




0.007 


0.000 


0.000 


0.065 


0.006 



Table 2: The error rates of TSF using the sphtting criterion: Entrance gain (TSF) or 
entropy gain (TSF-entropy) , random forest with 500 trees apphed to the interval features 
with sizes power of two (interRF), 1-NN with Euclidean distance (NNEuclidean) , 1-NN 
with the best warping window DTW (DTWBest) [HI, and 1-NN DTW with no warping 
window (DTWNoWin). The win-lose-tie results of each competitor compared to TSF, 
the average rank of each classifier, the rank difference and the Wilcoxon signed ranks test 
between TSF and each competitor are also calculated. When multiple methods have the 
same error rate for a data set, the average rank is used. For example, both DTWBest and 
DTWNoWin have the minimum error rate 0.192 for the FaceAll data set, and, thus, the 
rank for both is 1.5. 



12| and the 1-NN DTW with no warping window (DTWNoWin) methods 
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acquired directly from Keogh et al. [9[. DTWBest has a fixed window limiting 
the window width and searches for the best window size, while DTWNoWin 
does not use such a window. 

The classification error rates are shown in Table [21 To compare multiple 
classifiers to TSF over multiple data sets, we used the procedure for compar- 
ing multiple classifiers with a control over multiple data sets suggested by 
Demsar [sl, i.e., the Friedman test ^ followed by the Bonferroni-Dunn test 
[3] if the Friedman test shows a significant difference between the classifiers. 
In our case, the Friedman test shows that there is a significant difference 
between the six classifiers at the 0.001 level. Therefore, we proceeded with 
the Bonferroni-Dunn test. 

For the Bonferroni-Dunn test, the performance of two classifiers is dif- 
ferent at the a level if the their average ranks differ by at least the critical 
difference (CD): 

I classifieri,-^ classifier ~l~ 1) ^^(W 

V 'oNdata 

where Ndassifier is the number of classifiers in the comparison (six classifiers 
in our experiments), Ndata is the number of data sets (45 data sets in our 
experiments), and Qa is the critical value for the two-tailed Bonferroni-Dunn 
test for multiple classifier comparison with a control. Note go.os = 2.576 
and go.i = 2.326 (Table 5(b) in Demsar jsj), then according to Equation [T0| 
-20.05 = 1-016 and 2:0.1 = 0.917. The average rank of each classifier, and the 
difference between the average ranks of TSF and each competitor are shown 
in Table |2j According to the rank difference, there is a significant difference 
between TSF and competitors NNEuclidean, DTWNoWin and interRF at 
the 0.1 level. 

In addition to the multi-classifier comparison procedure, we also con- 



sidered Wilcoxon signed ranks test [18| suggested for comparing a pair of 



classifiers, as the resolution for the multi-classifier comparison procedure can 
be too low to distinguish two classifiers with significantly different perfor- 
mance, but with close average ranks. For example, for six classifiers and 45 
data sets, assume classifier A always ranks the first and classifier B always 
ranks the second. Although classifier A is always better than classifier B, the 
average ranks of classifier A and classifier B differ by only one, and therefore 
there is no significant difference between the two classifiers at the 0.05 level 
according to the two-tailed Bonferroni-Dunn test. 

The p-values of the Wilcoxon signed ranks tests between TSF and each 
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competitor are shown in Table [21 It can be seen there is a significant differ- 
ence between TSF and all other competitors: TSF-entropy, interRF, NNEu- 
clidean, DTWNoWin and DTWBest at the 0.1 level. 




Figure 5: Plot of the error rate of each data set versus the number of trees in TSF, and the 
average error rate over aU data sets versus the number of trees (represented by the thicker 
red hue). We want to show the trend so different data sets arc not distinguished. The 
error rates tend to decrease as the number of trees increases, but the change is relatively 
small for most data sets after 100 trees. 

Next consider the robustness of TSF accuracy to the number of trees. 
Figure [5] shows the error rate of each data set versus the number of trees, 
and the average error rate over all data sets versus the number of trees 
(represented by the thicker red line) . The error rates tend to decrease as the 
number of trees increases, but the change is relatively small for most data 
sets after 100 trees. 

The GunPoint and Wafter time series and their corresponding temporal 
importance curves (mean, standard deviation and slope) are shown in Fig- 
ure El For the GunPoint time series, the mean temporal importance curve 
captures the characteristic that the two classes have different means in inter- 
val [60,100]. The standard deviation and slope temporal importance curves, 
respectively, capture the characteristics that the two classes have different 
standard deviations and slopes in the left and right sides of the time se- 
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(c) The temporal importance curves for (d) The temporal importance curves for the 
the GunPoint data. Wafer data. 



Figure 6: The time series and the temporal importance curves (mean, standard deviation 
and slope) for the GunPoint data set and the Wafer data set, respectively. 
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ries. For the Wafer time series, the standard deviation temporal importance 
curve captures the sudden changes of the time series of class 1 near the 100*'^ 
point. Consequently, the temporal importance curve is able to provide in- 
sights into the temporal characteristics useful for distinguishing time series 
from different classes. 



5.3. Computational Complexity 

First consider the computational complexity of TSF with regard to the 
length of time series. We selected the data sets with more than 1000 time 
points. For each data set, AM of the time points were randomly sampled, 
where M is the length of the time series, and A is a multiplier. The computa- 



tional times for different values of A are shown in Figure 7(a) Next consider 



the computational complexity of TSF with regard to the number of training 
instances. Data sets with more than 1000 training instances were selected. 
For each data set, AA^ of the time points were randomly sampled, where 
is the number of training instances. The computational times for different 
values of A are shown in Figure 7(b) It can be seen that the computational 
time tends to be linear both in the time series length and in the number of 
training instances. 

Therefore, TSF is a computationally efficient classifier for time series. 
Furthermore, in the current TSF implementation, the interval features are 
dynamically calculated at each node, as pre-computing the interval features 
would need O(M^) features to be stored. It should noted, however, dynamic 
calculation can lead to repeated calculations of the interval features. There- 
fore, the implementation can be further improved by storing the interval 
features already calculated to avoid repeated calculations. 



6. Conclusions 

Both high accuracy and interpretability are desirable for classifiers. Previ- 
ous classifiers such as NNDTW can be accurate, but provide limited insights 
into the temporal characteristics. Interval features can be used to capture 
temporal characteristics, however, the huge feature space can result in many 
splits having the same entropy gain. Furthermore, the computational com- 
plexity becomes a concern when the feature space becomes large. 

Time series forest (TSF) proposed here addresses the challenges by using 
the following two strategies. Firstly, TSF uses a new splitting criterion named 
Entrance gain that combines the entropy gain and a distance measure to 
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(a) The computational time of TSF with (b) The computational time of TSF with 
regard to the time series length regard to the number of training instances 



Figure 7: The computational time of TSF with regard to the time series length and the 
number of training instances, respectively. Data sets with more than 1000 time points and 
1000 training instances were selected, respectively. The computational time tends to be 
linear both in the time series length and in the number of training instances. 



identify high-quality sphts. Experimental studies on 45 benchmark data 
sets show that the Entrance gain improves the accuracy of TSF. Secondly, 
TSF randomly samples 0{M) features from O(M^) features, and thus makes 
the computational complexity linear in the time series length. In addition, 
each tree in TSF is grown independently, and, therefore, modern parallel 
computing techniques can be leveraged to speed up TSF. 

TSF is an ensemble of trees and is not easy to understand. However, we 
propose the temporal importance curve, calculated from TSF, to capture the 
informative interval features. The temporal importance curve enables one to 
identify the important temporal characteristics. 

TSF uses simple summary statistical features, but outperforms widely 
used alternatives. More complex features, such as wavelets, can be also used 
in the framework of TSF, which potentially can further improve the accuracy 
performance, but at the cost of interpretability. 

In summary, TSF is an accurate, efficient time series classifier, and is able 
to provide insights on the temporal characteristics useful for distinguishing 
time series from different classes. We also note that TSF assumes that the 
time series are of the same length. Given a set of time series with different 
lengths, techniques such as dynamic time warping can be used to align the 
time series into the same length. Still, directly handling time series with 
varying lengths would make TSF more convenient to use, and future work 
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includes such an extension. 
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